The purpose of this report is to demonstrate exploratory
data analysis and multivariate statistical techniques and their
associated R tools. It is not intended to make any conclusions about the
study and should not be cited for that purpose.
The data used in this report were obtained from the Environmental Data Initiative repository. These data were collected as part of a biweekly sampling campaign of four alpine lakes (i.e., Green Lakes 1, 4, and 5 and Lake Albion) within Niwot Ridge, CO (Loria and Yevak, 2019). Each lake was sampled at shoreline, inlet, and outlet sites for benthic macroinvertebrates (BMIs), which were identified to order, family, and, when possible, genus. Six continuous environmental variables were measured:
Five categorical environmental variables were also tracked:
The BMI and environmental data were wrangled, explored, and
visualized before analysis using two multivariate techniques: principal
components analysis (PCA) and non-metric multidimensional scaling
(NMDS).
The study area was dipteran dominant; this order (primarily chironomids) made up over 81% of all BMIs (Fig. 1). Caddisflies composed just over 9% of individuals, and the third most abundant order were freshwater leeches (order Rhynchobdellida) with nearly 5% of all specimens and a total abundance of 117 individuals.
Figure 1. Total numbers of benthic macroinvertebrates (BMIs) collected by order.
BMIs were most commonly found in creek sites followed by inlet and waterfall locations; each location averaged more than 200 individuals (Fig. 2A). Lake and outlet sites had far fewer BMIs with neither location averaging more than 60 specimens. BMI abundance varied by local_site with ALB as the most abundant (> 140 individuals/site) and GL1, the least abundant (Fig. 2B).
Figure 2. Average numbers of BMIs collected by project_site and either location or local_site.
Figure 3 illustrates how the various numerical environmental variables (including the binary fish_presence and lotic variables) varied among local_sites.
Figure 3. Average value of environmental variable by local_site (lake).
Clearly no fish were found in GL4 and GL5, and GL1 comprised only
lentic sites. DO and sat were much greater, on
average, in GL5 sites compared to the other local_sites. GL5
sites were located at the highest elevations and unsurprisingly had the
lowest temperatures (temp), nitrate, on average, was
much greater in ALB and GL5 sites relative to GL1 and GL4 sites.
The multi-panel graph shows the pairwise relationships among all numerical (i.e., non-binary, non-categorical) environmental variables with linear regression lines, correlations, and indicators of significance. As expected, DO and sat were strongly, positively correlated per their tight regression line and r > 0.96 (Fig. 4). temp was significantly, negatively correlated with elevation, sat, DO, and nitrate. elevation was significantly, positively correlated with sat and DO.
Figure 4. Pairwise relationship for all six continuous environmental variables are drawn. The lower left half of the matrix contains all pairwise scatterplots with regression lines. The upper right half comprises correlations and, if applicable, significance stars. Density plots of each individual variable are constructed along the diagonal.
Scatter plots conditioned on lotic status (i.e., 0 = lentic and 1 = lotic) were visualized (Fig. 5). A few of the significant correlations discussed in the previous figure are inconsistent between lotic categories. For instance, temp, sat, and DO were significantly correlated with elevation (temp negatively, and the other two variables positively); however, when conditioned by lotic status, these relationships strengthen when the locations were lentic but were non-significant when lotic. Other patterns occurred with lotic status as a conditioning variable. For example, ph and sat were weakly correlated (r = 0.149) but when lotic status was considered, these relationships strengthened and became significant for both lotic (r = 0.722) and lentic (r = 0.588) locations.
Figure 5. Pairwise relationship for all six continuous environmental variables are drawn and distinguished according to lotic status: lentic (lotic = 0) is colored in red and lotic (lotic=1) is colored in green. The lower left half of the matrix contains all pairwise scatterplots with regression lines according to lotic status. The upper right half comprises correlations and, if applicable, significance stars for all points and by lotic status. Density plots of each individual variable according to lotic status are constructed along the diagonal.
Pairwise relationships between BMI abundances and the six numeric environmental variables were visualized as part of the exploratory data analysis (Fig. 6).
Figure 6. Scatterplots with regression lines of each numeric environmental variable regressed against total BMI abudnance.
Only ph was found to be significant, with an r =
-0.58.
There are two assumptions of PCA per Shipley (2021): 1) normally (or
at least symmetrically) distributed variables and 2) linear
relationships among variables.
Distributions of variables were assessed visually and statistically. First, a series of Q-Q plots were constructed (Fig. 7), which showed that variables ph and nitrate were clearly not normally distributed.
Figure 7. Q-Q plots of each untransformed numeric enviornmental variable are presented.
These results were supported by Shapiro tests.
Only the Q-Q plot for sat displayed normality, which was supported by the Shapiro test result. The Q-Q plot for elevation was borderline non-signifcant, and the remaining four variables were clearly non-normally distributed, particularly nitrate and temp, which had highly significant (p < .01) Shapiro test results.
Five of the six environmental variables underwent a Box-Cox transformation (i.e., only sat was left untransformed), and the Q-Q plots and Shapiro tests were re-run.
Figure 8. Q-Q plots of six numeric environmental variables are presented. All variables but sat were Box-Cox transformed.
These results (Fig. 8; Table 2) indicate considerable improvement in the distributions of DO_trans and temp_trans and a slight improvement for elevation_trans (note: transformed variables will be indicated with “[variable name]*_trans). However, ph_trans* and nitrate_trans were still non-normal following Box-Cox transformations. Additional transformations (i.e., log, square-root, inverse) were calculated for these two variables, and none could achieve normality. Given that only a symmetrical distribution is needed for a PCA, symmetry was assessed by measuring skewness and running a Jarque-Bera Test, which measures whether data have skewness and kurtosis similar to a normal distribution.
## ph_trans nitrate_trans
## -0.03295884 -0.05704655
These results (Table 3) indicate that both ph_trans and nitrate_trans that underwent Box-Cox transformations had symmetrical distributions as neither test result was significant.
All numerical environmental variables were retained. All variables
but sat were Box-Cox transformed.
The second assumption of PCA is that the variables have linear relationships (Shipley, 2021). This was assessed visually using scatter plots and regression lines and quantitatively via correlations.
Figure 9. Pairwise relationship for all six continuous environmental variables are drawn. All variables but sat have undergone Box-Cox transformation. The lower left half of the matrix contains all pairwise scatterplots with regression lines. The upper right half comprises correlations and, if applicable, significance stars. Density plots of each individual variable are constructed along the diagonal.
Aside from a few variable pairs involving nitrate_trans (e.g., sat-nitrate_trans, elevation_trans-nitrate_trans, DO_trans-nitrate_trans), there tended to be linear relationships among pairs of variables. In the nitrate_trans scatter plots, there did not appear to be a linear (or even non-linear) relationship with sat, elevation_trans, or DO_trans, which was supported by weak correlations (-0.15 < r < 0.15).
Given that nitrate_trans does not have linear relationships
with all other environmental variables, it was dropped from the PCA.
A PCA was run on the five remaining continuous environmental variables (i.e., sat_trans, elevation_trans, temp_trans, DO_trans and ph_trans), and all variables but sat were transformed.
The PCA summary indicates that that the first two principal
components (PCs) accounted for ~81% of the total variance. The loadings
show that sat and elevation_trans were moderately
negatively correlated and that temp_trans and DO_trans
were moderately positively correlated with PC1, while ph_trans
was strongly positively correlated with PC2.
A scree plot was drawn to determine which PCs to plot.
Figure 10. The scree plot of the PCA is presented. The five principal components are on the x-axis and their inertia values (i.e., eigenvalues) are on the y-axis. A broken stick model is overlaid on top of the scree plot.
This plot shows that 1) the largest drop in eigenvalues occurred from
PC1-PC2; 2) PC1 and PC2 were above the mean eigenvalue; and 3) only PC1
was above the broken stick model.
Given that PC2 was above the mean eigenvalue and that it accounted for ~22% of the total variance, a distance-preserving biplot of PC1 and PC2 from the alpine lakes BMI data was constructed.
Figure 11. A biplot of the alpine lake sites is drawn on the first two principal components. The environmental variables are overlaid.
The biplot of sites on PC1 and PC2 axes reflect the loadings
described above. sat and elevation_trans were
moderately negatively correlated and temp_trans and
DO_trans were moderately positively correlated with PC1, while
ph_trans was strongly positively correlated with PC2. There
appear to be four groups of sites: 1) low elevation and high pH, 2) near
the origin, 3) high temperature and pH, and 4) low pH.
Categorical variables were used to distinguish sites on biplots. fish_presence, lotic, and local_site are displayed in Figure 12.
Figure 12. Biplots of the alpine lake sites are drawn on the first two principal components with environmental variables overlaid. Site locations are colored by A) fish_presence, B) lotic, or C) local_site status.
These biplots indicate strong discrimination among sites according to
categorical variables. For example, fish_presence tended to
diverge along PC1. Sites with fish appear to have higher temperatures
and dissolved oxygen levels, while sites without fish tend to have
higher dissolved oxygen saturation and occur at greater elevation.
The PC scores were tested for differences among groups of categorical variables using ANOVAs. First, an omnibus test was run to determine whether differences occurred regardless of PC axis. If a significant test resulted, a by-axis test was conducted to ascertain where the differences occurred with respect to each PC axis.
The omnibus test results indicated that there were significant differences among sites according to categorical variables local_site, fish_presence, and lotic but not for location or shore (Table 6). Thus, by-axis tests were conducted on the former three variables.
The by-axis test results indicated that site scores differed
significantly among local_sites along both PC1 and PC2 (Table
7). This result was supported by the biplot, which shows clear clusters
that form according to local_site groups along both PC axes.
Sites differed significantly between fish_presence levels along
PC1 only, which was corroborated by the biplot. Specifically, sites
without fish (i.e., fish_presence = 0) tended to have negative PC1
values, while sites with fish had positive values. This suggested that
sites without fish tended to occur at greater elevation and have higher
dissolved oxygen saturation levels, while sites with fish were driven by
warmer temperatures and higher dissolved oxygen levels. Furthermore,
site scores along PC2 encompass a wide range of values regardless of
fish_presence category, suggesting that ph_trans was a
less important driver. Finally, site score patterns according to
lotic class appeared to have an opposite pattern relative to
fish_presence status. Here discrimination among sites by
lotic group was more apparent along PC2 with lentic sites
tending to have positive PC2 scores and lotic sites with more negative
PC2 scores. Separation along PC1 was not apparent. This suggested that,
on average, lentic sites had a greater pH than lotic sites.
The first step of an NMDS analysis is to choose the type of distance
metric. The square-roots of Bray-Curtis dissimilarities were chosen.
Bray-Curtis dissimilarities are non-metric but can produce negative
eigenvalues; thus, their square roots were used for this analysis
(Shipley, 2021).
The second step of an NMDS analysis is to choose the number of dimensions. A series of NMDS models are fit onto the site scores of the first two principal coordinates analysis (PCoA) axes while successively increasing the number of dimensions. Stress levels are observed as the number of dimensions (axes) increase using a scree plot.
Figure 13. The scree plot for a non-metric multidimensional scaling (NMDS) analysis of the alpine lake BMI data has the number of dimensions on the x-axis and the corresponding stress levels on the y-axis.
This plot (Fig. 13) shows that there was a large decrease in stress
from one to two dimensions before more gradual decline in stress levels.
The stress level at two dimensions was 0.0807. This pattern suggests
using two dimensions for the NMDS analysis.
The sensitivity of solutions for each site value was determined using a jackknife approach (i.e., assess differences in final configuration if each site was left out).
##
## Call: jackmds.smacofB(object = bmi_nmds_default)
##
## SMACOF Jackknife
## Number of objects: 22
## Value loss function: 1.2808
## Number of iterations: 17
##
## Stability measure: 0.9966
## Cross validity: 0.9999
## Dispersion: 0.0035
Figure 14. A jacknife plot of BMI data is presented along the first two NMDS axes.
The stability of measure and cross validity values were very close to
1, indicating that they do not differ within rounding error between
jackknifed runs. Only two sites (i.e., 1, 14) show significant change
when jackknifing (Fig. 14). These results indicate that the NMDS
solution obtained when using the site scores on the first two PCoA axes
as starting values is close to or at the best NMDS solution, which is
highly stable (Shipley, 2021).
A Shepard diagram was constructed to see how Bray-Curtis dissimilarities have been converted into Euclidean distances in the two-dimensional NMDS ordination.
Figure 15. A Shepard plot of Bray-Curtis dissimilarities (x-axis) and Euclidean distances (y-axis) of alpine lake BMI data is shown above.
The Shepard diagram shows that there was roughly a linear
relationship from the origin until ~0.7 Bray-Curtis dissimilarities
before transitioning to an approximately exponential relationship (Fig.
15). This pattern indicated that pairs of sites with similar familial
composition of BMI (i.e., small Bray-Curtis dissimilarities) were placed
closer together in the NMDS solution, while pairs of sites with more
dissimilar familial composition were placed further apart.
Biplots of sites were plotted on NMDS axes 1 and 2, and points were colored by groups within the five categorical variables: local_site, location, fish_presence, lotic, and shore.
Figure 16. These five NMDS biplots contain the alpine lake sites plotted on the first two NMDS axes. Site locations are colored by different categorical environmental variables per the five panels: A) local_site, B) location, C) fish_presence, D) lotic, and E) shore.
Out of the five categorical variables, only the two binary
variables–fish_presence and lotic–showed
discrimination of sites along NMDS axes 1 and 2. Thus, numerical
environmental variables were drawn onto their biplots to improve
analysis and interpretation.
Numerical variables can be related to ordination axes in two ways: 1)
predict environmental variables given their site scores on ordination
axes and 2) predict site scores on ordination axes given their
environmental variables.
The NMDS plot was constructed with sites color coded by fish_presence category and environmental variables overlain either as dependent variables or predictors.
Figure 17. NMDS biplots of the alpine lake sites with points colored by fish_presence status are drawn. Environmental variables individually regressed on the first two NMDS axes are presented in (A), while each NMDS axis regressed on environmental variables are presented in (B).
Fig. 17A shows that as NMDS1 values increase, so does ph_trans and to a lesser extent, oxygen and nitrate_trans, while elevTemp decreases. As NMDS2 values increase, so does ph_trans and to a smaller degree, elevTemp and oxygen, while nitrate_trans decreases. Site scores where fish were present (in upper left) were positively associated with elevTemp and negatively associated with nitrate_trans.
Statistical analysis of these relationships shows that only ph was significant (Table 9).
Fig. 17B shows that as NMDS1 values increase, so does oxygen, and to a lesser extent, ph_trans and nitrate_trans, while elevTemp decreases. As NMDS2 values increase, so does ph_trans and elevTemp, while oxygen and nitrate_trans decrease. Site scores where fish were present (~ 0, 0.5) were positively associated with elevTemp and negatively associated with oxygen.
Statistical analyses were run on the regressions of NMDS axes 1 and 2 on environmental variables.
These results indicated that for NMDS1, ph_trans,
oxygen, and elevTemp were significant predictors
(Table 10). Regarding NMDS2, no partial regression coefficients of
environmental variables were significant, although ph_trans was
marginally significant (p < .078).
A second set of biplots with numerical environmental variables was constructed for site scores colored by lotic category.
Figure 18. NMDS biplots of the alpine lake sites with points colored by fish_presence status are drawn. Environmental variables individually regressed on the first two NMDS axes are presented in (A), while each NMDS axis regressed on environmental variables are presented in (B).
Interpretation of arrows and their statistical significance is the
same for these plots as they were for Fig. 17. However, Fig. 18 shows
that a cluster of lentic sites (i.e., where lotic = 0) were positively
associated with elevTemp and negatively associated with
nitrate_trans and that a cluster of lotic sites were
positively associated with elevTemp and negatively associated
with ph_trans.
The sample analyzed for this report comprised 2,353 BMIs and
associated environmental data collected from 22 sites in four alpine
lakes in Niwot Ridge, CO, from 7/31 - 8/19/18. Dipterans were by far the
most populous order; they composed more than 80% of all individuals
collected and no other order exceeded 10% of total BMI abundance. Green
Lake 5 had, on average, the highest DO and sat
relative to the other four local_sites. Fish were present only
in Lake Albion and Green Lake 1, and the latter was the only
local_site without a lotic location. temp was
greatest, on average, in GL1 sites and lowest in GL5 sites.
Unsurprisingly, sat and DO were strongly positively
correlated. Both of these variables, elevation, and
nitrate were each significantly negatively correlated with
temp. DO and sat were each significantly
positively correlated with elevation. When correlation analyses
were performed separately on BMI abundances and the six continuous
environmental variables, only ph was a significant correlate.
nitrate was dropped, and all remaining continuous
environmental variables but sat underwent Box-Cox
transformations to fulfill the assumptions of PCA. A summary of the PCA
indicated that PC1 and PC2 explained roughly 59% and 22%, respectively,
of the total variance. sat and elevation_trans were
negatively associated with PC1, while temp_trans and
DO_trans were positively associated with this principal
component. ph_trans was strongly positively associate with PC2.
PC1 and PC2 were selected for visualization and further analysis
following interpretation of the Scree plot. The distance-preserving
biplot of these data suggested four groups of sites: 1) low elevation
and high pH, 2) near the origin (no associations), 3) high temperature
and pH, and 4) low pH. After incorporating categorical environmental
variables, PC scores significantly varied among local_sites and
lotic and fish_presence categories overall. By-axis
tests indicated that PC scores varied by local_site for both
PC1 and PC2, by fish_presence category for PC1 only, and by
lotic category for PC2 only.
Pearson correlations of all pairs of continuous environmental
variables were performed to consolidate highly correlated variable pairs
into composite variables; this was necessary for
sat-DO, which became oxygen, and
elevation-temp, which became elevTemp, as
each pair was highly correlated. An NMDS analysis was conducted on
square-roots of Bray-Curtis dissimilarities of these data. Biplots of
sites were plotted on NMDS axes 1 and 2. After coloring sites by each
categorical environmental variable (in separate plots), only
fish_presence and lotic showed discrimination of sites
along these two axes. Thus, a second set of plots (two per variable)
were constructed where continuous environmental variables were added as
arrows, either showing variables regressed on ordination axes or the
converse, for both fish_presence and lotic. When
ordination axes were regressed on environmental variables, sites scores
with fish present were positively associated with elevTemp and
negatively associated with oxygen. Furthermore, a group of
lentic sites was positively correlated with elevTemp and
negatively associated with nitrate_trans, whereas lotic sites
were also positively associated with elevTemp and were
negatively associated with ph_trans.
Loria, K. and S. Yevak. 2019. Macroinvertebrate trait, abundance and site quality data from the Green Lakes Valley, 2018 ver 1. Environmental Data Initiative. https://doi.org/10.6073/pasta/038a8ab20a6ffa2044b885c6a999112e (Accessed 2023-04-16).
Shipley, B. 2021. Ordination methods for biologists: a non-mathematical introduction using R. Quebec: BS Publishing.